5023B

Data Science for Biologists

Dr Philip Leftwich

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.


Teach Genetics, Programming, and Statistics

UEA logo

Introductions

Outline of the course

  • Advanced linear modelling
  • Power analysis
  • Data reproducibility
  • R programming
  • Machine Learning

Expectations

  • One workshop per week

  • One lecture per week

  • One assignment per week

  • One ‘capstone’ project

What to expect during this course


I hope you end up with more questions than answers!


Schitts Creek questions gif

Source: giphy.com

Reproducible Research

What is Reproducibility?

Introduction to Reproducible Research

Turing Way Community cc-by licence

  • For Research to be reproducible both data and methods should be available.

  • Applying the described methods to the data leads to the same results

Methods

  • In theory, method availability ≠ code

  • But with complex data and analyses - are methods of data collection enough?

Self-correcting Science?

  • Science advances incrementally by identifying and rectifying errors over time

  • Peer review: Critical evaluation of papers by experts maintain quality

  • Independent studies either support or fail to replicate findings

Self-correcting Science?

  • Publication bias: preference for positive results

  • Pressure to publish

  • Poor study designs and statistical issues

  • Lack of transparency

Reproducibility trial:

246 biologists get different results from same data sets

: Forest plots of meta-analytic estimated standardized effect sizes (Zr, blue triangles) and their 95% confidence intervals for each effect size included in the meta-analysis model. (A) Blue tit analyses: Points where Zr are less than 0 indicate analyses that found a negative relationship between sibling number and nestling growth. (B) Eucalyptus analyses: Points where Zr are less than 0indicate a negative relationship between grass cover and Eucalyptus seedling success

Reproducibility Crisis

  • The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.

  • High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.

Crisis as an Opportunity

  • Recognition that no study should be considered ‘definitive’

  • Empower lasting systemic change through increased transparency in research methods, data sharing and reporting

  • Structural change in academic culture

Open Science

Open Science

Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.

UKRN

  • UK Reproducibility Network - funded by UK Research Council

  • 46 member institutions (UEA is one)

  • Establish open research practices across UK Research

  • https://www.ukrn.org/

UKRN

Project management

Tidy projects

/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R

Organised projects

  • README

  • Documented

  • Easy to code with

  • All files are inside the root folder

R projects

R projects

Slugs

  • A string of characters defining a file

What do you think are the contents of these files:

  • data/raw/madrid_minimum-temperature.csv

  • scripts/02_compute_mean-temperature.R

  • analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd

Name files

Come up with good names for these:

  • a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.

  • a script that downloads data from Spotify.

  • a scripts that cleans up data.

  • a scripts that fits a linear discriminant model and saves it to a file.

R projects and clean slates

R projects

  • Use projects

  • Check your code runs on blank slates

Quarto

  • Automates the creation of a paper or report

  • Saves time

  • Reduces errors

copy-paste

(https://www.nature.com/articles/d41586-022-00563-z)

Git

copy-paste

Git repository

copy-paste

Git collab

copy-paste

Forking

copy-paste

Renv

copy-paste

copy-paste

Benefits

Resources

Additional resources

  • Discovering Statistics - Andy Field

  • Happy Git

  • An Introduction to Generalized Linear Models - Dobson & Barnett

  • An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani

  • Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.

  • Ecological Statistics with contemporary theory and application

  • The Big Book of R (https://www.bigbookofr.com/)

  • British Ecological Society Guides to Better Science

*(SORTEE)

Reading list

Reading list